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In this study, we examined yeast proteins by two-dimensional (2D) gel electrophoresis and gathered quan- 
titative information from about 1,400 spots. We found that there is an enormous range of protein abundance 
and, for identified spots, a good correlation between protein abundance, mRNA abundance, and codon bias. 
For each molecule of well-translated mRNA, there were about 4,000 molecules of protein. The relative 
abundance of proteins was measured in glucose and ethanol media. Protein turnover was examined and found 
to be insignificant for abundant proteins. Some phosphoproteins were identified. The behavior of proteins in 
differential centrifugation experiments was examined. Such experiments with 2D gels can give a global view of 
the yeast proteome. 



The sequence of the yeast genome has been determined (9). 
More recently, the number of mRNA molecules for each ex- 
pressed gene has been measured (27, 30). The next logical level 
of analysis is that of the expressed set of proteins. We have 
begun to analyze the yeast proteome by using two-dimensional 
(2D) gels. 

2D gel electrophoresis separates proteins according to iso- 
electric point in one dimension and molecular weight in the 
other dimension (21), allowing resolution of thousands of pro- 
teins on a single gel. Although modern imaging and computing 
techniques can extract quantitative data for each of the spots in 
a 2D gel, there are only a few cases in which quantitative data 
have been gathered from 2D gels. 2D gel electrophoresis is 
almost unique in its ability to examine biological responses 
over thousands of proteins simultaneously and should there- 
fore allow us a relatively comprehensive view of cellular me- 
tabolism. 

We and others have worked toward assembling a yeast pro- 
tein database consisting of a collection of identified spots in 2D 
gels and of data on each of these spots under various condi- 
tions (2, 7, 8, 10, 23, 25). These data could then be used in 
analyzing a protein or a metabolic process. Saccharomyces 
cerevisiae is a good organism for this approach since it has a 
well-understood physiology as well as a large number of mu- 
tants, and its genome has been sequenced. Given the sequence 
and the relative lack of introns in S. cerevisiae, it is easy to 
predict the sequence of the primary protein product of most 
genes. This aids tremendously in identifying these proteins on 
2D gels. 

There are three pillars on which such a database rests: (i) 
visualization of many protein spots simultaneously, (ii) quan- 
tification of the protein in each spot, and (iii) identification of 
the gene product for each spot. Our first efforts at visualization 
and identification for 5. cerevisiae have been described else- 
where (7, 8). Here we describe quantitative data for these 
proteins under a variety of experimental conditions. 

MATERIALS AND METHODS 
Strains and media. 5. cerevisiae W303 {MATb ade2-l his3- 11,15 leu2-3, 112 
trpl'l ura3-l can 1-100) was used (26). -Met YNB (yeast nitrogen base) medium 
was 1.7 g of YNB (Difco) per liter, 5 g of ammonium sulfate per liter, and 



adenine, uracil, and all amino acids except methionine; -Met -Cys YNB me- 
dium was the same but without methionine or cysteine. Medium was supple- 
mented with 2% glucose (for most experiments) or with 2% ethanol (for ethanol 
experiments). Low-phosphate YEPD was described by Warner (28). 

Isotopic labeling of yeast and preparation of cell extracts. Yeast strains were 
labeled and proteins were extracted as described by Garrels et al. (7, 8). Briefly, 
cells were grown to 5 x 10* cells per ml. at 30"C; 1 ml of culture was transferred 
to a fresh tube, and 03 mCi of p-^Slmethionine (e.g., Express protein labeling 
mix; New England Nuclear) was added to this 1-ml culture. The cells were 
incubated for a further 10 to 15 min and then transferred to a L5-mI microcen- 
trifuge tube, chilled on ice, and harvested by centrifugation. The supernatant was 
removed, and the cell pellet was resuspended in 100 p.1 of lysis buffer (20 mM 
Tris-HCI [pH 7.6], 10 mM NaF, 10 mM sodium pyrophosphate, 0.5 mM EDTA, 
0.1% deoxycholate; just before use, phenylmethylsulfonyl fluoride was added to 
I mM, leupeptin was added to 1 [ig'ml, pepstatin was added to 1 jig/ml, tosyl- 
sulfonyl phenylalanyl chloromethyl ketone was added to 10 M-g/ml, and soybean 
trypsin inhibitor was added to 10 p-g/ml). 

The resuspended cells were transferred to a screw-cap 1.5-ml polypropylene 
tube containing 0.28 g of glass beads (0.5-mm diameter; Biospec Products) or 
0.40 g of zirconia beads (0.5-mm diameter; Biospec Products). After the cap was 
secured, the tube was inserted into a MiniBeadbeater 8 (Biospec Products) and 
shaken at medium high speed at A'^C for 1 min. Breakage was typically 75%. 
Tubes were then spun in a microcentrifuge for 10 s at 5,000 X ^ at 4''C. 

With a very fine pipette tip, liquid was withdrawn from the beads and trans- 
ferred to a prechilled 1.5-ml tube containing 7 p.1 of DNase I (0.5 mg/ml; Cooper 
product no. 6330)-RNase A (0.25 mg/ml; Cooper product no. 5679)-Mg (50 mM 
MgClj) mix. Typically 70 p-l of liquid was recovered. The mixture was incubated 
on ice for 10 min to allow the RNase and DNase to work. 

Next, 75 fjLl of 2X dSDS (2x dSDS is 0.6% sodium dodecy! sulfate [SDS], 2% 
mercaptoethanol, and 0.1 M Tris-HCI [pH S]) was added. The tube was plunged 
into boiling water, incubated for 1 min, and then plunged into ice. After cooling, 
the tube was centrifuged at 4^ for 3 min at 14,000 x g. The supernatant was 
transferred to a fresh tube and frozen at -70'C. About 5 p-l of this supernatant 
was used for each 2D gel. 

2D polyacrylamide gels. 2D gels were made and run as described elsewhere 
(6-^). ^ 

Image analysis of the gels. The Quest II software system was used for quan- 
titative image analysis (20, 22). Two techniques were used to collect quantitative 
data for analysis by Quest II software. First, before the advent of phosphorim- 
agers, gels were dried and fluorographed. Each gel was exposed to film for three 
different times (typically 1 day, 2 weeks, and 6 weeks) to increase the dynamic 
range of the data. The films were scanned along with calibration strips to relate 
film optical density to disintegrations per minute in the gels and analyzed by the 
software to obtain a linear relationship between disintegrations per minute in the 
spots and optical densities of the film images. The quantitative data are ex- 
pressed as parts per million of the total cellular protein. This value is calculated 
from the disintegrations per minute of the sample loaded onto the gel and by 
comparing the film density of each data spot with density of the film over the 
calibration strips of known radioactivity exposed to the same film. This yields the 
disintegrations per minute per millimeter for each spot on the gel and thence its 
parts-per-minute value. 

After the advent of phosphorimaging, gels bearing ^^S-labeled proteins were 
exposed to phosphorimager screens and scanned by a Fuji phosphorimager, 
typically for two exposures per gel. Calibration strips of known radioactivity were 
exposed simultaneously. Scan data from the phosphorimager was assimilated by 
Quest II software, and quantitative data were recorded for the spots on the gels. 
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Measurements of protein turnover. Cells in exponential phase were pulse- 
labeled with [^^S]methionine, excess cold Met and Cys were added, and samples 
of equal volume were taken from the culture at intervals up to 90 min (in one 
experiment) or up to 160 min (in a second experiment). Incorporation of ^^S into 
protein was essentially 100% by the first sample (10 min). Extracts were made, 
and equal fractions of the samples were loaded on 2D gels (i.e., the different 
samples had different amounts of protein but equal amounts of ^^S). Spots were 
quantitated with a phosphorimaging and Quest software. 

The software was queried for spots whose radioactivity decreased through the 
time course. The algorithm examined all data points for all spots, drew a best-fit 
line through the data points, and looked for spots where this line had a statis- 
tically significant negative slope. In one of the experiments, there was one such 
spot. To the eye, this was a minor, unidentified spot seen only in the first two 
samples (10 and 20 min). In the other experiment, the Quest software found no 
spots meeting the criteria. Therefore, we concluded that none of the identified 
spots (and all but one of the visible spots) represented proteins with long 
half-lives. 

Centrinigal fractionation. Cells were labeled, harvested, and broken with glass 
beads by the standard method described above except that no detergent (i.e., no 
deoxycholate) was present in the lysis buffer. The crude lysate was cleared of 
unbroken cells and large debris by centrifugation at 300 x g for 30 s. The 
supernatant of this centrifugation was then spun at 16,000 X g for 10 min to give 
the pellet used for Fig. 6B. The supernatant of the 16,000 x g, 10-min spin was 
then spun at 100,000 X g for 30 min to give the supernatant used for Fig. 6A. 

Protein abundance calculations. A haploid yeast cell contains about 4 x 10"'^ 
g of protein (1, 15). Assuming a mean protein mass of 50 kDa, there are about 
50 x 10" molecules of protein per cell. There are about 1.8 methionines per 10 
kDa of protein mass, which implies 4.5 x 10^ molecules of methionine per cell 
(neglecting the small pool of free Met). We measured (i) the counts per minute 
in each spot on the 2D gels, (ii) the total number of counts on each gel (by 
integrating counts over the entire gel), and (iii) the total number of counts 
loaded on the gel (by scintillation counting of the original sample). Thus, we 
know what fraction of the total incorporated radioactivity is present in each spot. 
After correcting for the methionine (and cysteine [see below]) content of each 
protein, we calculated an absolute number of protein molecules based on the 
fraction of radioactivity in each spot and on 50 x 10*^ total molecules per cell. 

The labeling mbcture used contained about one-fifth as much radioactive 
cysteine as radioactive methionine. Therefore, the number of cysteine molecules 
per protein was also taken into account in calculating the number of molecules 
of protein, but Cys molecules were weighted one-fifth as heavily as Met mole- 
cules. 

mRNA abundance calculations. For estimation of mRNA abundance, we used 
SAGE (serial analysis of gene expression) data (27) and Afiymetrix chip hybrid- 
ization data (29a, 30). The mRNA column in Table 1 shows mRNA abundance 
calculated from SAGE data alone. However, the SAGE data came from cells 
growing in YEPD medium, whereas our protein measurements were from cells 
growing in YNB medium. In addition, SAGE data for low-abundance mRNAs 
suffers from statistical variation. Therefore, we also used chip hybridization data 
(29a, 30) for mRNA from cells grown in YNB. These hybridization data also had 
disadvantages. First, the amounts of high-abundance mRNAs were systemati- 
cally underestimated, probably because of saturation in the hybridizations, which 
used 10 Jig of cRNA. For example, the abundance oiADHl mRNA was 197 
copies per cell by SAGE but only 32 copies per cell by hybridization, and the 
abundance of EN02 mRNA was 248 copies per cell by SAGE but only 41 by 
hybridization. When the amount of cRNA used in the hybridization was reduced 
to 1 the apparent amounts of mRNA were similar to the amounts determined 
by SAGE (29a, 29b). However, experiments using 1 ^ig of cRNA have been done 
for only some genes (29a). Because amounts of mRNA were normalized to 
15,000 per cell, and because the amounts of abundant mRNAs were underesti- 
mated, there is a 2.2-fold overestimate of the abundance of nonabundant 
mRNAs. We calculated this factor of 2.2 by adding together the number of 
mRNA molecules from a large number of genes expressed at a low level for both 
SAGE data and hybridization data. The sum for the same genes from hybrid- 
ization data is 2.2-fold greater than that from SAGE data. 

To take into account these difficulties, we compiled a list of "adjusted" mRNA 
abundance as follows. For all high-abundance mRNAs of our identified proteins, 
we used SAGE data. For all of these particular mRNAs, chip hybridization 
suggested that mRNA abundance was the same in YEPD and YNB media. For 
medium-abundance mRNAs, SAGE data were used, but when hybridization 
data showed a significant difference between YEPD and YNB, then the SAGE 
data were adjusted by the appropriate factor. Finally, for low-abundance 
mRNAs, we used data from chip hybridizations from YNB medium but divided 
by 2.2 to normalize to the SAGE results. These calculations were completed 
without reference to protein abundance. 

CAI. The codon adaptation index (CAl) was taken from the yeast proteome 
database (YPD) (13), for which calculations were made according to Sharp and 
Li (24). Briefly, the index uses a reference set of highly expressed genes to assign 
a value to each codon, and then a score for a gene is calculated from the 
frequency of use of the various codons in that gene (24). 

Statistical analysis. The JMP program was used with the aid of T. Tully. The 
JMP program showed that neither mRNA nor protein abundances were nor- 
mally distributed; therefore, Spearman rank correlation coefficients (r,) were 



calculated. The mRNA (adjusted and unadjusted) and protein data were also 
transformed so that Pearson product-moment correlation coefficients (r ) could 
be calculated. First, this was done by a Box-Cox transformation of log-trans- 
formed data. This transformation produced normal distributions, and an r of 
0.76 was achieved. However, because the Box-Cox transformation is complex%e 
also did a simpler logarithmic transformation. This produced a normal distribu- 
tion for the protein data. However, the distribution for the mRNA and adjusted 
mRNA data was close to, but not quite, normal. Nevertheless, we calculated the 
and found that it was 0.76, identical to the coefficient from the Box-Cox 
transformed data. We therefore believe that this correlation coefficient is not 
misleading, despite the fact that the log(mRNA) distribution is not quite normal. 



RESULTS 

Visualization of 1,400 spots on three gel systems. Yeast 
proteins have isoelectric points ranging from 3.1 to 12.8, and 
masses ranging from less than 10 kDa to 470 kDa. It is difficult 
to examine all proteins on a single kind of gel, because a gel 
with the needed range in pi and mass would give poor resolu- 
tion of the thousands of spots in the central region of the gel. 
Therefore, we have used three gel systems: (i) pH "4 to 8" with 
10% polyacrylamide; (ii) pH "3 to 10" with 10% polyacryl- 
amide; and (iii) nonequilibrium with 15% polyacrylamide (7, 
8). Each gel system allows good resolution of a subset of yeast 
proteins. 

Figure 1 shows a pH 4-8, 10% polyacrylamide gel. The pH 
at the basic end of the isoelectric focusing gel cannot be main- 
tained throughout focusing, and so the proteins resolved on 
such gels have isoelectric points between pH 4 and pH 6.7. For 
these pH 4^ gels, we see 600 to 900 spots on the best gels after 
multiple exposures. 

The pH 3-10 gels (not shown) extend the pi range somewhat 
beyond pH 7.5, allowing detection of several hundred addi- 
tional spots. Finally, we use nonequilibrium gels with 15% 
acrylamide in the second dimension. These allow visualization 
of about 100 very basic proteins and about 170 small proteins 
(less than 20 kDa), In total, using all three gel systems, about 
1,400 spots can be seen. These represent about 1,200 different 
proteins, which is about one-quarter to one-third of the pro- 
teins expressed under these conditions (27, 30). Here, we focus 
on the proteins seen on the pH 4-8 gels. 

Although nearly all expressed proteins are present on these 
gels, the number seen is limited by a problem we call coverage. 
Since there are thousands of proteins on each gel, many pro- 
teins comigrate or nearly comigrate. When two proteins are 
resolved, but are close together, and one protein spot is much 
more intense than the other, a problem arises in visualizing the 
weaker spot: at long exposures when the weak signal is strong 
enough for detection, the signal from the strong spot spreads 
and covers the signal from the weaker spot. Thus, weak spots 
can be seen only when they are well separated from strong 
spots. 

For a given gel, the number of detectable spots initially rises 
with exposure time. However, beyond an optimal exposure, the 
number of distinguishable spots begins to decrease, because 
signals from strong spots cover signals from nearby weak spots. 
At long exposures, the whole autoradiogram turns black. Thus, 
there is an optimum exposure yielding the maximum number 
of spots, and at this exposure the weakest spots are not seen. 

Largely because of the problem of coverage, the proteins 
seen are strongly biased toward abundant proteins. All identi- 
fied proteins have a CAI of 0.18 or more, and we have iden- 
tified no transcription factors or protein kinases, which are 
nonabundant proteins. Thus, this technology is useful for ex- 
amining protein synthesis, amino acid metabolism, and glyco- 
lysis but not for examining transcription, DNA replication, or 
the cell cycle. 
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Spot identificatioii. The identification of various spots has 
been described elsewhere (7, 8). At present, 169 different spots 
representing 148 proteins have been identified. Many of these 
spots have been independently identified (2, 10, 23, 25). The 
main methods used in spot identification have been analysis of 
amino acid composition, gene overexpression, peptide se- 
quencing, and mass spectrometry. 

Pulse-chase experiments and protein turnover. Pulse-chase 
experiments were done to measure protein half-lives (Materi- 
als and Methods). Cells were labeled with [^^SJmethionine for 
10 min, and then an excess of unlabeled methionine was added. 
Samples were taken at 0, 10, 20, 30, 60, and 90 min after the 
beginning of the chase. Equal amounts of ^^S were loaded from 
each sample; 2D gels were run, and spots were quantitated. 
Surprisingly, almost every spot was nearly constant in amount 
of radioactivity over the entire time course (not shown). A few 
spots shifted from one position to another because of post- 
translational modifications (e.g., phosphorylation of RpaO and 
Efbl). Thus, the proteins being visualized are all or nearly all 
very stable proteins, with half-lives of more than 90 min, Gygi 
et al. (10) have come to a similar conclusion by using the N-end 
rule to predict protein half-lives. This result does not imply 
that all yeast proteins are stable. The proteins being visualized 
are abundant proteins; this is partly because they are stable 
proteins. 

Protein quantitation. Because all of the proteins seen had 
effectively the same half-life, the abundance of each protein 
was directly proportional to the amount of radioactivity incor- 
porated during labehng. Thus, after taking into account the 
total number of protein molecules per cell, the average content 
of methionine and cysteine, and the methionine and cysteine 
content of each identified protein, we could calculate the abun- 
dance of each identified protein (Tables 1 and 2; Materials and 
Methods). About 1,000 unidentified proteins were also quan- 
tified, assuming an average content of Met and Cys. 

Many proteins give multiple spots (7, 8). The contribution 
from each spot was summed to give the total protein amount. 
However, many proteins probably have minor spots that we are 
not aware of, causing the amount of protein to be underesti- 
mated. 

When the proteins on a pH 4-8 gel were ordered by abun- 
dance, the most abundant protein had 8,904 ppm, the 10th 
most abundant had 2,842 ppm, the 100th most abundant had 
314 ppm, the 500th most abundant had 57 ppm, and the 
1,000th most abundant (visualized at greater than optimum 
exposure) had 23 ppm. Thus, there is more than a 300-fold 
range in abundance among the visualized proteins. The most 
abundant 10 proteins account for about 25% of the total pro- 
tein on the pH 4-8 gel, the most abundant 60 proteins account 
for 50%, and the most abundant 500 proteins account for 80%. 
Since it seems likely that the pH 4-8 gels give a representative 
sampling of all proteins, we estimate that half of the total 
cellular protein is accounted for by fewer than 100 different 
gene products, principally glycolytic enzymes and proteins in- 
volved in protein synthesis. 

Correlation of protein abundance with mRNA abundance. 
Estimates of mRNA abundance for each gene have been made 
by SAGE (27) and by hybridization of cRNA to oligonucleo- 
tide arrays (30). These two methods give broadly similar re- 
sults, yet each method has strengths and weaknesses (Materials 
and Methods). Table 1 lists the number of molecules of mRNA 
per cell for each gene studied. One measurement (mRNA) 
uses data from SAGE analysis alone (27); a second incorpo- 
rates data from both SAGE and hybridization (30) (adjusted 
mRNA) (Table 1; Materials and Methods). We correlated 
protein abundance with mRNA abundance (Fig. 2). For ad- 



justed mRNA versus protein, the Spearman rank correlation 
coefficient, r^, was 0.74 {P < 0.0001), and the Pearson corre- 
lation coefficient, r^, on log transformed data (Materials and 
Methods) was 0.76 {P < 0.00001). We obtained similar corre- 
lations for mRNA versus protein and also for other data trans- 
formations (Materials and Methods). Thus, several statistical 
methods show a strong and significant correlation between 
mRNA abundance and protein abundance. Of course, the cor- 
relation is far from perfect; for mRNAs of a given abundance, 
there is at least a 10-fold range of protein abundance (Fig. 2)! 
Some of this scatter is probably due to posttranscriptional 
regulation, and some is due to errors in the mRNA or protein 
data. For example, the protein Yef3 runs poorly on our gels, 
giving multiple smeared spots. Its abundance has probably 
been underestimated, partly explaining the low protein/mRNA 
ratio of YeO. It is the most extreme outlier in Fig, 2. 

These data on mRNA (27, 30) and protein abundance (Ta- 
ble 1) suggest that for each mRNA molecule, there are on 
average 4,000 molecules of the cognate protein. For instance, 
for Actl (actin) there are about 54 molecules of mRNA per 
cell and about 205,000 molecules of protein. Assuming an 
mRNA half-life of 30 min (12) and a cell doubling time of 120 
min, this suggests that an individual molecule of mRNA might 
be translated roughly 1,000 times. These calculations are lim- 
ited to mRNAs for abundant proteins, which are likely to be 
the mRNAs that are translated best. 

A full complement of cell protein is synthesized in about 120 
min under these conditions. Thus, 4,000 molecules of protein 
per molecule of mRNA implies that translation initiates on an 
mRNA about once every 2 s. This is a remarkably high rate; it 
implies that if an average mRNA bears 10 ribosomes engaged 
in translation, then each ribosome completes translation in 
20 s; if an average protein has 450 residues; this in turn implies 
translation of over 20 amino acids per s, a rate considerably 
higher than estimated for mammalians (3 to 8 amino acids per 
s) (18). These estimates depend on the amount of mRNA per 
cell (11,27). 

The large number of protein molecules that can be made 
from a single mRNA raises the issue of how abundance is 
controlled for less abundant proteins. Many nonabundant pro- 
teins may be unstable, and this would reduce the protein/ 
mRNA ratio. In addition, many nonabundant proteins may be 
translated at suboptimal rates. We have found that mRNAs for 
nonabundant proteins usually have suboptimal contexts for 
translational initiation. For example, there are over 600 yeast 
genes which probably have short open reading frames in the 
mRNA upstream of the main open reading frame (17a). These 
may be devices for reducing the amount of protein made from 
a molecule of mRNA, 

Correlation of codon bias with protein abundance. The 
mRNAs for highly expressed proteins preferentially use some 
codons rather than others specifying the same amino acid (14). 
This preference is called codon bias. The codons preferred are 
those for which the tRNAs are present in the greatest amounts. 
Use of these codons may make translation faster or more 
efficient and may decrease misincorporation. These effects are 
most important for the cell for abundant proteins, and so 
codon bias is most extreme for abundant proteins. The effect 
can be dramatic— highly biased mRNAs may use only 25 of the 
61 codons. 

We asked whether the correlation of codon bias with abun- 
dance continues for medium-abundance proteins. There are 
various mathematical expressions quantifying codon bias; here, 
we have used the CAI (24) (Materials and Methods) because 
it gives a result between 0 and 1. The for CAI versus protein 
abundance is 0.80 {P < 0.0001), similar to the mRNA-protein 



TABLE 1. Quantitative data" 



Function 



Name 


CAI 


mRNA 


Adjusted ml 




U.olU 


197 


197 




U.DU4 


0 




n 1 fi^ 
U.loj 


1 


2.8 


Enol 


U.cS/U 


NoMfl 


Eno2 




9/1Q 


248 


Fbal 


U.oOo 


1 /y 


179 




U.jUU 


13 


10.5 


Icll 


n 9^1 


U 


Pdbl 


n 149 




5 


Pdcl 


0.903 


226 


226 


Pfkl 


0.465 


5 


5 


Pgil 


0.681 


14 


14 


Pycl 


0.260 


1 


0.7 


Tall 


0.579 


5 


5 


1 anZ 


0.904 


63 


63 


Tdh3 


0.924 


460 


460 


Tpil 


0,817 


No Nla 




Efbl 


0.762 


33 


16.5 


bitl,2 


0.801 


26 


26 


Prtl 


0.303 


4 


0.7 


RpaO 


0.793 


246 


246 


Tin,2 


0.752 


29 


29 


Yef3 


0.777 


36 


36 


Hsc82 


n SRI 


£. 


T n 

z.y 




yj.joi 


Q 


2.3 


Hsp82 


U.J 1 / 


L 


1.3 


Hspl04 


0.304 


1 


7 


Kar2 


0.439 


5 


10.1 


Ssal 


0.709 


2 


4.3 


Ssa2 


0.802 


10 


5 


Ssbl,2 


0.850 


50 


50 


Sscl 


0.521 


2 


2.6 
8 


Ssel 


0.521 


8 


Stil 


0.247 


1 


1.1 


Adel 


n 99Q 


A 


4 


AdeS 


0.276 


9 


1 1 


Ade5,7 




2. 


1.4 


Arg4 


n 990 


1 
1 


D 1 
O.l 


Gdhl 


fl SR*\ 
u. JoJ 


in 


11 


Glnl 


0.524 


11 


11 


His4 


0.267 


3 


3 


Ilv5 


0.801 


6 


6 


Lys9 


0.332 


4 


4 


Met6 


0.657 


No Nla 


22 


Pro2 


0.248 


3 


3 


Serl 


0.258 


2 


1.2 


Trp5 


0.319 


5 


5 


Actl 


n 7ifi 


Sd 


j4 


Adkl 




Mr* 




Ald6 


0.520 


J 




Atp2 


0.424 


1 

1 


A 1 

4.1 


Bmhl 


0.322 


HO 


40 


Bmh2 


n "^84 

U. jO*t 


1 
1 


1 /< 
1.4 


Cdc48 


U.jUO 


£. 


LA 




n 900 


I 


0.86 


nrgzu 






5 


Gppl 


n Am 

U.OUJ 


1 £ 

lo 


5 


VJdpi 




i 


3 


Irtn1 
ippl 


n A9n 


4 


4 




n 1 77 


n '5 
U.J 


0.8 


IVIOI 1 




U 


0.45 


raU 1 
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3 
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15 
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5 


u 


Saml 


0494 


5 


5 


Sam2 
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3 


15 


Sodl 


0.376 


36 


36 


Ubal 


0.212 


2 


2 


YKL056 


0.731 


62 


62 


YLR109 


0.549 


21 


21 


YMRlie 


0.777 


41 


41 



Protein (Glu) (10*"') Protein (Eth) (10^) E/G ratio 



Carbohydrate metabolism 



Protein synthesis 



Heat shock 



Amino acid synthesis 



Miscellaneous 



1,230 
0 
23 
410 
650 
640 
62 
0 
41 
280 
75 
160 
37 
110 
430 
1,670 
No Met 

358 
99 
12 
277 
233 
14 

112 
35 
52 
70 
43 
303 
213 
270 
68 
96 
25 

14 

12 

14 

41 
148 

77 

15 
152 

32 
190 

30 

15 

28 

205 

47 
181 

76 
191 
134 

32 
6 

92 
234 
115 
254 

19 

20 

41 
148 

44 

59 

63 
631 

14 
253 
930 
184 



972 
963 
288 
974 
215 
608 

46 
671 

33 
205 

53 
120 

34 

35 
876 
1,927 
No Met 

362 
54 
6 
100 
106 
ND 

75 
82 
135 
161 
102 
421 
324 
85 
80 
48 
44 

27 
9 
4 
41 
55 
104 
23 
109 
17 
80 
12 
8 
12 

164 

43 
159 
109 
137 
147 

26 
2 

39 
158 

39 
147 

40 

16 

19 

56 

37 

21 

20 
618 

20 
112 

40 



0.79 
>20 
12 
2.4 
0.33 
0.95 

>20 

0.73 
0.71 
0.75 



NR 
NR 



0.55 

0.36 
0.46 



0.67 

2.3 

2.6 

2.3 

2.4 

1.4 

1.5 

1.2 

1.7 



1.3 

1.5 

0.7 

0.52 

0.42 



0.78 



1.4 

0.72 



0.34 
0.58 



0.47 



0.44 
0.20 



' CAI, a measure of codon bias, is taken from the YPD. mRNA, number of mRNA molecules per cell from SAGE data (27); adjusted mRNA number of mRNA 



I ratio is 

a protein or an mRNA basis; these arc pooled. No Nla, there was no suitable NlaWX sUe in the 3' region of the gene^ and so iheVe aVe'no SAGE mlST^ 
the mature gene product contains no methionines, and so there are no reliable protein data. 
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TABLE 2. Functions of proteins listed in Table 1 



Name" 



YPD title lines^ 



Adhl 

Adh2 

Cit2 

Enol 

Eno2 

Fbal 

Hxkl 

Hxk2 

Icll 

Pdbl 

Pdcl 

Pfkl 

Pgil 

Pycl 

Tall 

Tdh2 

Tdh3 

Tpil 

Efbl 
Eftl 
Eft2 
Prtl 

RpaO (RPPO) 

TiTl 

Tif2 

Yef3 

Hsc82 
Hsp60 
Hsp82 
Hspl04 

Kar2 

Ssal 
Ssa2 
Ssbl 
Ssb2 
Sscl 

Ssel 
Stil 

Adel 

Ade3 

Ade5,7 

Arg4 

Gdhl 

Glnl 

His4 

Ilv5 

Lys9 
Met6 

Pro2 
Serl 
Trp5 

Actl 

Adkl 

Ald6 

Atp2 

Bmhl 

Bmh2 

Cdc48 

Cdc60 

Erg20 

Gppl (Rhr2) 
Gspl 

{^^^, 

Moll (Thi4) 
Pabl 

Psal 

Rnr4 

Saml 

Sam2 

Sodl 

Ubal 

YKL056 
YLR109 (Ahpl) 
YMRn6(Ascl) 



Alcohol dehydrogenase I; cytoplasmic isozyme reducing acetaldehyde to ethanol, regenerating NAD"*" 
Alcohol dehydrogenase II; oxidizes ethanol to acetaldenyde, glucose repressed 

Citrate synthase, peroxisomal (non mitochondrial); converts acetyl-CoA and oxaloacetate to citrate plus CoA 
Enolase 1 r2-phosphoglycerate dehydratase^ converts 2-phospho-i>-gIycerate to phosphoe no! pyruvate in glycolysis 
Enolase 2 (2-phosphoglycerate dehydratase); converts 2-phospho-D-glycerate to phosphoenolpyruvate in glycolysis 
Fructose bisphospnate aldolase 11; sixth step in glycolysis 

Hexokinase I; converts hexoses to hexose phosphates in glycolysis; repressed by glucose 

Hexokinase II; converts hexoses to hexose phosphates in glycolysis and plays a regulatory role in glucose repression 
Isocitrate lyase, peroxisomal; carries out part of the glyoxylate cycle; required for gluconeogenesis 
F^ruvate dehydrogenase complex, El beta subunit 
FVruvate decarboxylase isozyme 1 

Phosphofructokinase alpha subunit, part of a complex with Pfk2p which carries out a key regulatory step in glycolysis 
GIucose-6-phosphate isomerase, converts g!ucose-6-phosphate to fructose-6-phosphate 
Pyruvate carboxylase 1; converts pyruvate to oxaloacetate for gluconeogenesis 
Transaldolase; component of nonoxidative part of pentose phosphate pathway 

GIyceraIdehyde-3-pnosphate dehydrogenase 2; converts D-glyceraldehyde 3-phosphate to 1,3-dephosphoglycerate 
GlyceraIdehyde-3-phosphate dehydrogenase 3; converts D-glyceraldehyde 3-phosphate to 1,3-dephosphoglycerate 
Tnosephosphate isomerase; interconverts glyceraidehyde-3-phosphate and dihydroxyacetone phosphate 

Translation elongation factor EF-ip; GDP/GTP exchange factor for Teflp/Tef2p 

Translation elongation factor EF-2; contains diphthamide which is not essential for activity; identical to Eft2p 
Translation elongation factor EF-2; contains diphthamide which is not essential for activity; identical to Eftlp 
Translation initiation factor elF3 beta subunit (p90); has an RNA recognition domain 
Acidic ribosomal protein AO 

Translation initiation factor 4 A feIF4AJ of the DEAD box family 
Translation initiation factor 4A (eIF4A) of the DEAD box family 
Translation elongation factor EF-3A; member of ATP-binding cassette superfamily 

Chaperon in homologous to E. coli HtpG and mammalian HSP90 

Mitochondrial chaperonin that cooperates with HsplOp; homolog of £. coli GroEL 

Heat-inducible chaperonin homologous to £. coli HtpG and mammalian HSP90 

Heat shock protein required for induced thermotolerance and for resolubilizing aggregates of denatured proteins; important for [psi ]- 
to-[PSP] prion conversion 

Heat shock protein of the endoplasmic reticulum lumen required for protein translocation across the endoplasmic reticulum membrane 

and for nuclear fusion; member of the HSP70 family 
cytoplasmic chaperone; heat shock protein of the HSP70 family 
cytoplasmic chaperone; member of the HSP70 family 
Heat shock protein of HSP70 family involved in the translational apparatus 
Heat shock protein of HSP70 family, cytoplasmic 

Mitochondrial protein that acts as an irnport motor with Tim44p and plays a chaperonin role in receiving and folding of protein chains 

during import; heat shock protein of HSP70 family 
Heat shock protein of the HSP70 family; multicopy suppressor of mutants with hyperactivated Ras/cyclic AMP pathway 
Stress-induced protein required for optimal growth at nigh and low temperature; has tetratricopeptide repeats 

Phosphoribosylamidoimidazole-succinocarboxamide synthase: catalyzes the seventh step in de novo purine biosynthesis pathway 
C, tetrahydrofolate synthase Ttrifunctional enzyme), cytoplasmic 

Phosphoribosylamine-glycine ligase plus phosphoribosylformylglycinamidine cyclo-ligase; bifunctional protein 
Argininosuccinate lyase; catalyzes the final step in argmine biosynthesis 

Glutamate dehydrogenase (NADP"^); combines ammonia and a-ketoglutarate to form glutamate 
Glutamine synthetase; combines ammonia to ^utamate in ATP-driven reaction 

Phosphoribosvl-AMP cyclohydroIase/phosphoribosyl-ATP pyrophosphohydrolase/histidinol dehydrogenase; 2nd, 3rd, and 10th steps of 
his biosynthesis pathway 

Ketol-acid reductoisomerase (acetohydroxy, acid reductoisomerase) (alpha-keto-p-hydroxylacyl) re ductoisom erase); second step in Val 
and Ilv biosynthesis pathway 

Saccharopine dehydrogenase (NADP"*", L-glutamate forming) (saccharopine reductase), seventh step in lysine biosynthesis pathway 
Homocysteine methyl transferase; (5-methyltetrahydropteroyl triglutamate-homocysteine methyltransferase), methionine synthase, 
cobalamin independent 

7-Glutamyl phosphate reductase (phosphoglutamate dehydrogenase), proline biosynthetic enzyme 
Phosphoserine transaminase; involved m synthesis of serine from 3-phosphoglycerate 
Tryptophan synthase, last (5th) step in tryptophan biosynthesis pathway 

Actin; involved in cell polarization, endocytosis, and other cytoskeletal functions 
Adenylate kinase (GTP:AMP phosphotransferase), cytoplasmic 
C^tosolic acetaldehyde dehydrogenase 

Beta subunit of Fl-ATP synthase; 3 copies are found in each Fl oligomer 
Homolog of mammalian 14-3-3 protein; has strong similarity to Bmh2p 
Homolog of mammalian 14-3-3 protein; has strong similarity to Bmhlp 

Protein of the AAA family of ATPases; required for cell division and homotypic membrane fusion 
Leucyl-tRNA synthetase, cytoplasmic 

Farnesyl pyrophosphate synthetase; may be rate-limiting step in sterol biosynthesis pathway 
DL-Glycerol phosphate phosphatase 

Ran, a GTP-binding protein of the Ras superfamily involved in trafficking through nuclear pores 
Inorganic pyrophosphatase, cytoplasmic 

Component of serine C- palmitoyltransferase; first step in biosynthesis of long-chain base component of sphingolipids 
Thiamine-repressed protein essentia! for growth in the absence of thiamine 

Poly(A)-binding protein of cytoplasm and nucleus; part of the 3'-end RNA-processing complex (cleavage factor I); has 4 RNA 

recognition domains 
Mannose-1 -phosphate guanyltransferase; GDP-mannose pyrophosphorylase 
Ribonucleotide reductase small subunit 
5-Adenosylmethionine synthetase I 
5-Adenosylmethionine synthetase 2 
Copper-zinc superoxide dismutase 
Ubiquitin-activating (El) enzyme 

Resembles translationally controlled tumor protein of animal cells and higher plants 
Alkyl hydroperoxide reductase 

Abundant protein with effects on translational efficiency and cell size, has two WD (WD-40) repeats 



" Accepted name from the Saccharomyces genome database and YPD. Names in parentheses represent recent changes. 
Courtesy of Proteome, Inc., reprinted with permission. 
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FIG. 2. Correlation of protein abundance with adjusted mRNA abundance. 
The number of molecules per cell of each protein is plotted against the number 
of molecules per cell of the cognate mRNA, with an of 0.76. Note the 
logarithmic axes. Data for mRNA were taken from references 27 and 30 and 
combined as described in Materials and Methods. 



correlation, confirming a strong correlation between CAI and 
protein abundance (Fig. 3). The relationship between CAI and 
protein abundance is log linear from about 1,000,000 to about 
10,000 molecules per cell. We have no data for rarer proteins. 

It is not clear whether CAI reflects maximum or average 
levels of protein expression. The proteins used for the CAI- 
protein correlation included some proteins which were not 
expressed at maximum levels under the condition of the ex- 
periment (Hsc82, Hspl04, Ssal, Adel, Arg4, His4, and others). 
When these proteins were removed from consideration and 
the correlation between CAI and the remaining (presumably 
constitutive) proteins was recalculated, the was essentially 
unchanged (not shown). 

The equation describing the graph in Fig. 3 is log (protein 
molecules/cell) = (2.3 X CAI) + 3.7. Thus, under certain 
conditions (a CAI of 0.3 or greater; a constitutively expressed 
gene), a very rough estimate of protein abundance can be 
made by raising 10 to the power of [(2.3 X CAI) 4- 3,7]. 

The distribution of CAI over the genome (Fig. 4) consists of 
a lower, bell-shaped distribution, possibly indicating a region 
where there is no selection for codon bias, and an upper, flat 
distribution, starting at a CAI of about 0.3, possibly indicating 
a region where there is selection for codon bias. Almost all of 
the proteins whose abundance we have measured are in the 
upper, flat portion of the distribution. In the lower, bell-shaped 
region, we do not know whether there is a correlation between 
CAI and protein abundance. 

Changes in protein abundance in glucose and ethanol. A 
comparison of cells grown in glucose (Fig. lA) with cells grown 
in ethanol (Fig. IB) is shown in Table 1. As is well known, 
some proteins are induced tremendously during growth on 
ethanol. Two striking examples are the peroxisomal enzymes 
Icll (isocitrate lyase) and Cit2 (citrate synthase), which are 
induced in ethanol by more than 100- and 12-fold, respectively 
(Fig. 1; Table 1). These enzymes are key components of the 
glyoxylate shunt, which diverts some acetyl coenzyme A 
(acetyl-CoA) from the tricarboxylic acid cycle to gluconeogen- 
esis. 5. cerevisiae requires large amounts of carbohydrate for its 
cell wall; in ethanol medium, this carbohydrate comes from 
gluconeogenesis, which depends on the glyoxylate shunt and 
on the glycolytic pathway running in reverse. The need for 



gluconeogenesis also explains why glycolytic enzymes are 
abundant even in ethanol medium. Thus, 2D gel analysis shows 
the prominence of the glycolytic and glyoxylate shunt enzymes 
in cells grown on ethanol, emphasizing that gluconeogenesis, 
presumably largely for production of the cell wall, is a major 
metabolic activity under these conditions. 

During gluconeogenesis, substrate-product relationships are 
reversed for the glycolytic enzymes. One might expect that not 
all glycolytic enzymes would be well adapted to the reverse 
reaction. Indeed, 2D gels show that in ethanol, Adh2 (alcohol 
dehydrogenase 2) is strongly induced (16), while its isozyme 
Adhl is not greatly affected. Adhl and Adh2 each interconvert 
acetaldehyde and ethanol. Adhl has a relatively high K^^ for 
ethanol (17 mM), while Adh2 has a lower (0.8 mM) (5). 
Thus, it is thought that Adhl is specialized for glycolysis (ac- 
etaldehyde to ethanol), while Adh2 is specialized for respira- 
tion (ethanol to acetaldehyde) (5, 29). Similarly, Enol (enolase 
1) is induced in ethanol, while its isozyme Eno2 (enolase 2) 
decreases in abundance (Table 1) (4, 19). Enol is inhibited by 
2-phosphoglycerate (the glycolytic substrate), while Eno2 is 
inhibited by phosphoenolpyruvate (the gluconeogenic sub- 
strate) (4). Perhaps Enol has a lower for phosphoenol- 
pyruvate than does Eno2, though to our knowledge this has not 
been tested. Thus, the 2D gels distinguish isozymes specialized 
for growth on glucose (Adhl and Eno2) from isozymes spe- 
cialized for ethanol (Adh2 and Enol). 

Many heat shock proteins (e.g., Hsp60, Hsp82, Hspl04, and 
Kar2) were about twofold more abundant in ethanol medium 
than in glucose medium. This is consistent with the increased 
heat resistance of cells grown in ethanol (3). 

Enzymes involved in protein synthesis (Eftl, RpaO, and Tifl) 
were about twice as abundant in glucose medium as in ethanol 
medium. This may reflect the higher growth rate of the cells in 
glucose. 

Phosphorylation of proteins. To examine protein phosphor- 
ylation, we labeled cells with •'^P and ran 2D gels to examine 
phosphoproteins. About 300 distinct spots, probably represent- 
ing 150 to 200 proteins, could be seen on pH 4-8 gels (Fig. 5B). 
We then aligned autoradiograms of three gels, each with a 
different kind of labeled protein (^^P only [Fig. 5B], -^^P plus 
^^S [Fig. 5A], and ^^S only [not shown, but see Fig. 1 for 
example]). In this way, we made provisional identification of 
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FIG. 3. Correlation of protein abundance with CAI. The number of mole- 
cules per cell of each protein is plotted against the CAI for that protein. Note the 
logarithmic scale on the protein axis. Data for the CAI are from the YPD 
database (13). 
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FIG. 4. Distribution of CAI over the whole genome, shown in intervals of 0.030 (i.e., there are 150 genes with a CAI between 0.000 and 0.030, inclusive; 31 genes 
with a CAI between 0.031 and 0.060; 269 genes with a CAI between 0.061 and 0.090; 1,296 genes with a CAI between 0.091 and 0.120; etc.). The distribution peaks 
with 2,028 genes with a CAI between 0.121 and 0.150. 



some of the ^^P-labeled spots as particular ^^S-labeled spots. 
All such identifications are somewhat uncertain, since precise 
alignments are difficult, and of course multiple spots may ex- 
actly comigrate. Nevertheless, we believe that most of the 
provisional identifications are probably correct. Among the 
major ^^P-labeled proteins are the hexokinases Hxkl and 
Hxk2, the acidic ribosome-associated protein RpaO, the trans- 
lation factors Yef3 and Efbl, and probably Hsp70 heat shock 
proteins of the Ssa and Ssb families. RpaO and Efbl are quan- 
titatively monophosphorylated. 

Many yeast proteins resolve into multiple spots on these 2D 
gels (7). YeO has five or more spots, at least four of which 
comigrate with ^^P. Tpil has a major spot showing no ^^P 
labeling and a minor, more acidic spot which overlaps with 
some ^ P label. Tifl has at least seven spots (7); two of these 
overlap with some ^^P label, but five do not (Fig. 5). Eftl has 
at least three spots (7), and none of these overlap with ^^P, 
although there are three nearby, unidentified ^^P-labeled spots 
(a, c, and d in Fig. 5). Spots that seem to be extra forms of 
Met6, Pdcl, Eno2, and Fbal can be seen in Fig. 6A, but there 
is little ^^P at these positions in Fig. 5. Thus, phosphorylation 
explains some but not all of the different protein isoforms seen. 

The cell cycle is regulated in part by phosphorylation. We 
compared ^^P-labeled proteins from cells synchronized in G, 
with a-factor, in cells synchronized in Gi by depletion of Gi 
cyclins, and in cells synchronized in M phase with nocodazole. 
Only very minor differences were seen, and these were difficult 
to reproduce. The cell cycle proteins regulated by phosphory- 
lation may not be abundant enough for this technique to be 
applied easily. 

Centrifugal fractionation. We fractionated ^^S-labeled ex- 
tracts by centrifugation (Materials and Methods). Figure 6A 
shows the proteins in the supernatant of a high-speed 
(100,000 X g, 30 min) centrifugation, while Fig. 6B shows the 
proteins in the pellet of a low-speed (16,000 X g, 10 min) 
centrifugation. Many proteins are tremendously enriched in 
one fraction or the other, while others are present in both. 



Most glycolytic enzymes (e.g., Tdh2, Tdh3, Eno2, Pdcl, Adhl, 
and Fbal) are enriched in the supernatant fraction. The only 
exception is Pfkl (not indicated), which is found in both pellet 
and supernatant fractions. Many proteins involved in protein 
synthesis (Eftl, YeO, Prtl, Tifl, and RpaO) are in the pellet, 
possibly because of the association of ribosomes with the en- 
doplasmic reticulum. However, Efbl is in the supernatant, as is 
a substantial portion of the Eftl. Perhaps surprisingly, several 
mitochondrial proteins (Atp2 [not shown] and Ilv5) are largely 
in the supernatant. Perhaps glass bead breakage of cells re- 
leases mitochondrial proteins. The nuclear protein Gspl is in 
the pellet fraction. The enrichment produced by centrifugation 
makes it possible to see minor spots which are otherwise poorly 
resolved from surrounding proteins. Figure 6B shows that the 
previously identified Tifl spot is surrounded by as many as six 
other spots that cofractionate. We observed six identical or 
very similar additional spots when we overexpressed Tifl from 
a high-copy-number plasmid (not shown). Signal overlaps only 
one or two of these spots in ^^P-labeling experiments (Fig. 5), 
and so the different forms are not mainly due to different 
phosphorylation states. 

DISCUSSION 

Our experience with developing a 2D gel protein database 
for S. cerevisiae is summarized here. With current technology, 
we can see the most abundant 1,200 proteins, which is about 
one-third to one-quarter of the proteins expressed. The re- 
maining proteins will be difficult to see and study with the 
methods that we have used, not because of a lack of sensitivity 
but because weak spots are covered by nearby strong spots. 

Of the 1,200 proteins seen, we have identified 148, with a 
bias toward the most abundant proteins. Steady application of 
the methods already used would allow identification of most of 
the remaining proteins. Gene overexpression will be particu- 
larly useful, since it is not affected by the lower abundance of 
the remaining visible proteins. 
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FIG. 5. Phosphorylated proteins. (A) Mixture of ^-P-labeled proteins and ^^S-labeled proteins. Two separate labeling reactions were done, one with ^^P and one 
with ^^S, and extracts were mixed and run on a 2D gel. Spots marked with numbers rather than gene names represent spots noted on ^^S gels but unidentified. Spots 
labeling with ^^P were identified by (i) increased labeling compared to the ^^S-only gel (not shown); (ii) the characteristic fuzziness of a *^^P-labeled spot; and (iii) the 
decay of signal intensity seen on exposures made 4 weeks later (not shown). A minor form of Tpi I and at least six minor forms of Tif 1 have been noted in overexpression 
experiments (see also Fig. 6B); positions of the minor forms are indicated by circles. (B) ^^P-only labeling. The major form of Tpil, which is not labeled with ''-P, is 
indicated by a large circle; positions of seven forms of Tifl are indicated by smaller circles. 
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FIG. 6. Fractionation by centrifugation. (A) Proteins in the supernatant of a 100,000 x g, 30-min spin; proteins in the pellet of a 16,000 x g, 10-min spin. Supernatant 
fractions examined in multiple experiments done over a wide range of ^ forces looked similar to each other, as did the pellet fractions. 
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2D gels of the kind that we have used are not suitable for 
visualization of rare proteins. However it will be possible to 
study on a global basis metabolic processes involving relatively 
abundant proteins, such as protein synthesis, glycolysis, glu- 
coneogenesis, amino acid synthesis, cell wall synthesis, nucle- 
otide synthesis, lipid metabolism, and the heat shock response, 

Gygi et al, (10) have recently completed a study similar to 
ours. Despite generating broadly similar data, Gygi et al. 
reached markedly different conclusions. We believe that both 
mRNA abundance and codon bias are useful predictors of 
protein abundance. However, Gygi et al. feel that mRNA 
abundance is a poor predictor of protein abundance and that 
"codon bias is not a predictor of either protein or mRNA 
levels" (10). These different conclusions are partly a matter of 
viewpoint. Gygi et al. focus on the fact that the correlations of 
mRNA and codon bias with protein abundance are far from 
perfect, while we focus on the fact that, considering the wide 
range of mRNA and protein abundance and the undoubted 
presence of other mechanisms affecting protein abundance, 
the correlations are quite good. 

However, the different conclusions are also partly due to 
different methods of statistical analysis and to real differences 
in data. With respect to statistics, Gygi et al. used the Pearson 
product-moment correlation coefficient (r^) to measure the 
covariance of mRNA and protein abundance. Depending on 
the subset of data included, their values ranged from 0.1 to 
0.94. Because of the low r values with some subsets of the 
data, Gygi et al. concludecf that the correlation of mRNA to 
protein was poor. However, the correlation is a parametric 
statistic and so requires variates following a bivariate normal 
distribution; that is, it would be valid only if both mRNA and 
protein abundances were normally distributed. In fact, both 
distributions are very far from normal (data not shown), and so 
a calculation of is inappropriate. There was no statistical 
backing for the assertion that codon bias fails to predict pro- 
tein abundance. 

We have taken two statistical approaches. First, we have 
used the Spearman rank correlation coefficient (r^). Since this 
statistic is nonparametric, there is no requirement for the data 
to be normally distributed. Using the r_y, we find that mRNA 
abundance is well correlated with protein abundance (r^ = 
0,74), and the CAI is also well correlated with protein abun- 
dance (r^ = 0.80) (and also with mRNA abundance [data not 
shown]). For the data of Gygi et al. (10), we obtained similar 
results, though with their data the correlation is not as good; 
= 0.59 for the mRNA-to-protein correlation, and = 0.59 for 
the codon bias-to-protein correlation. 

In a second approach, we transformed the mRNA and pro- 
tein data to forms where they were normally distributed, to 
allow calculation of an (Materials and Methods). Two trans- 
formations, Box-Cox and logarithmic, were used; both gave 
good correlations with our data [e.g., r = 0.76 for log(adjusted 
RNA) to log(protein)]. We were not able to transform the data 
of Gygi et al. to a normal distribution. 

Finally, there are also some differences in data between the 
two studies. These may be partly due to the different measure- 
ment techniques used: Gygi et al. measured protein abundance 
by cutting spots out of gels and measuring the radioactivity in 
each spot by scintillation counting, whereas we used phospho- 
rimaging of intact gels coupled to image analysis. We com- 
pared our data to theirs for the proteins common between the 
studies (but excluding proteins whose mRNAs are known to 
differ between rich and minimal media, and excluding Tifl, 
which was anomalous in differing by 100-fold between the two 
data sets). The r, between the two protein data sets was 0.88 
{P < 0.0001). Although this is a strong correlation, the fact that 



it is less than 1.0 suggests that there may have been errors in 
measuring protein abundance in one or both studies. After 
normalizing the two data sets to assume the same amount of 
protein per cell, we found a systematic tendency for the protein 
abundance data of Gygi et al. to be slightly higher than ours for 
the highest-abundance proteins and also for the lowest-abun- 
dance proteins but slightly lower than ours for the middle- 
abundance proteins. These systematic differences suggest some 
systematic errors in protein measurement. Although we do not 
know what the errors are, we suggest the following as a rea- 
sonable speculation. For the highest-abundance proteins, we 
may have underestimated the amount of protein because of a 
slightly nonlinear response of the phosphorimager screens. For 
the lowest-abundance proteins, Gygi et al. may have overesti- 
mated the amount of protein because of difficulties in accu- 
rately cutting very small spots out of the gel and because of 
difficulties in background subtraction for these small, weak 
spots. The difference in the middle abundance proteins may be 
a consequence of normalization, given the two errors above. 

The low-abundance proteins in the data set of Gygi et al. 
have a poor correlation with mRNA abundance. We calculate 
that the is 0.74 for the top 54 proteins of Gygi et al. but only 
0.22 for the bottom 53 proteins, a statistically significant dif- 
ference. However, with our data set, the is 0.62 for the top 33 
proteins and 0.56 (not significantly different) for the bottom 33 
proteins (which are comparable in abundance to the bottom 
53 proteins of Gygi et al.). Thus, our data set maintains a good 
correlation between mRNA and protein abundance even at 
low protein abundance. This is consistent with our speculation 
that protein quantification by phosphorimaging and image 
analysis may be more accurate for small, weak spots than is 
cutting out spots followed by scintillation counting. Our rela- 
tively good correlations even for nonabundant proteins may 
also reflect the fact that we used both SAGE data and RNA 
hybridization data, which is most helpful for the least abundant 
mRNAs. In summary, we feel that the poor correlation of 
protein to mRNA for the nonabundant proteins of Gygi et al. 
may reflect difficulty in accurately measuring these nonabun- 
dant proteins and mRNAs, rather than indicating a truly poor 
correlation in vivo. It is not surprising that observed correla- 
tions would be poorer with less-abundant proteins and 
mRNAs, simply because the accuracy of measurement would 
be worse. 

How well can mRNA abundance predict protein abun- 
dance? With rp = 0.76 for logarithmically transformed mRNA 
and protein data, the coefficient of determination, (r^)^ is 0.58. 
This means that more than half (in log space) of the variation 
in protein abundance is explained by variation in mRNA abun- 
dance. When converted back to arithmetic values, protein 
abundances vary over about 200-fold (Table 1), and (r^)^ = 
0.58 for the log data means that of this 200-fold variation, 
about 20-fold is explained by variation in the abundance of 
mRNA and about 10-fold is unexplained (but could be due 
partly to measurement errors). For proteins much less abun- 
dant than those considered here, we imagine the in vivo cor- 
relation between mRNA and protein abundance will be worse, 
and other regulatory mechanisms such as protein turnover will 
be more important. 

Some important conclusions can be drawn from this sam- 
pling of the proteome. First, there is an enormous range of 
protein abundance, from nearly 2,000,000 molecules per cell 
for some glycolytic enzymes to about 100 per cell for some cell 
cycle proteins (26a). Second, about half of all cellular protein 
is found in fewer than 100 different gene products, which are 
mostly involved in carbohydrate metabolism or protein synthe- 
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sis. Third, the correlation between protein abundance and CAI 
is log linear as far as we can see, which is from about 10,000 
protein molecules per cell to about 1,000,000. This is somewhat 
surprising, because it implies that selective forces for codon 
bias are significant even at moderate expression levels. It also 
means that codon bias is a useful predictor of protein abun- 
dance even for moderately low bias proteins. Fourth, there is a 
good correlation between protein abundance and mRNA 
abundance for the proteins that we have studied. This validates 
the use of mRNA abundance as a rough predictor of protein 
abundance, at least for relatively abundant proteins. Fifth, for 
these abundant proteins, there are about 4,000 molecules of 
protein for each molecule of mRNA. This last conclusion 
raises questions as to how the levels of nonabundant proteins 
are regulated and suggests that protein instability, regulated 
translation, suboptimal rates of translation, and other mecha- 
nisms in addition to transcriptional control may be very impor- 
tant for these proteins. 
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